Note for reviewer:

Thank you for looking into my project.
I have worked on it locally.
And sorry to say that the platform does not have an option to load a file without submitting it.
So I did not actually run the code at the platform.
And I will only be able to do that once the project is returned to me by you.
It should work, but I cannot promise that.

Thank you

intruduction

The purpose of this research is to give supervisors from the virtual telephony service CallMeMaybe
information on the least effective operators.

In [ ]:
 

Strategy.

The client CallMeMaybe has asked for information regarding effectiveness of operators according to:

  1. If they have a large number of missed incoming calls.
  2. Long waiting time for incoming calls.
  3. If an operator is supposed to make outgoing calls, a small number.

    I have added another criteria:
  4. Waiting duration time for outgoing calls.

For each subject an operator will receive a grate for least effective.
Meaning, the best score will be 0 - effective.
And the worst score will be 3 - least effective.

The max total of available points will be 12.

Operators will be sorted by grade is two ways:

  1. By month.
  2. Full term.

decompose.

Telecom: Identifying Ineffective Operators

The purpose of this task is to supply the virtual telephony service CallMeMaybe information regarding least effective operators.

Goals of the task:

  1. Carry out exploratory data analysis.
  2. Identify ineffective operators.
  3. Test statistical hypotheses

Decompose the Task

• Read the CSV files

Data Preprocessing

• Check columns types
• Change columns name

Carry out exploratory data analysis:

• Change columns types
• Check for duplicates
• How much data was lost ?
• Check for missing values
• How many operators are there ?
• How big is the data ?

Identify ineffective operators.

• What are the averages for operators ?
• Visualize distributions.
• Find percentiles of top and bottom operators.
• Look into distributions for clients with different tariff.
• Check amount of missed incoming calls, internal and external.
• Check long waiting time for incoming calls.
• Check if an operator is supposed to make outgoing calls, how many were made.

Test statistical hypotheses

Form hypotheses regarding the duration of the outgoing calls.
Are there differences between different days of the week ?
Are there differences between operators with different grades ?

Final conclusions.

In [ ]:
 

Presentation.

Tableau.

Setup.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

import math
import numpy as np

from scipy import stats as st

from plotly import graph_objects as go

pd.options.display.max_columns = None

pd.options.display.max_colwidth = 200

pd.options.display.float_format = "{:,.2f}".format

Functions.

In [2]:
def df_info(df):
    '''
    show information regarding the DF.
    '''
    
    display(df.head())
    print('shape:', df.shape)
    print()
    print('******** info() ********')
    print(df.info())
    print()
    print('******** info(memory_usage) ********')
    print(df.info(memory_usage = 'deep'))
    print()
    print('******** describe() ********')
    print(df.describe())
    print()
    print('******** describe(include) ********')
    print(df.describe(include = ['object']))
In [3]:
def fix_is_missed_call(row):
    '''
    find rows with a combination of TRUE for missed call with a duration > 0.    
    and change the value into FALSE    
    '''
    is_missed_call = row['is_missed_call']
    s = str(is_missed_call)
    call_duration = row['call_duration']
    
    if (s == 'True'):
        if(call_duration > 0):
            return False
        
    return is_missed_call
In [4]:
def in_missed_calls_month(row):
    '''
    rate for in_missed_calls_month
    '''
    is_missed_call = row['is_missed_call']
    
    if is_missed_call >= 3:
        return 3
    if is_missed_call >= 2:
        return 2
    if is_missed_call >= 1:
        return 1
    return 0
In [5]:
def in_missed_calls(row):
    '''
    rate for in_missed_calls
    '''

    is_missed_call = row['is_missed_call']
    
    if is_missed_call >= 4:
        return 3
    if is_missed_call >= 3:
        return 2
    if is_missed_call >= 2:
        return 1
    return 0
In [6]:
def out_AVG_wait(row):
    '''
    rate out_AVG_wait 
    '''
    AVG_wait_duration = row['AVG_wait_duration']
    
    if AVG_wait_duration > 30:
        return 3
    if AVG_wait_duration > 25:
        return 2
    if AVG_wait_duration > 20:
        return 1
    return 0
In [7]:
def in_AVG_wait(row):
    '''
    rate in_AVG_wait
    '''
    
    wait_duration = row['AVG_wait_duration']
    
    if wait_duration > 32:
        return 3
    if wait_duration > 21:
        return 2
    if wait_duration > 15:
        return 1
    return 0
In [8]:
def out_calls_cnt_month(row):
    '''
    rate out_calls_cnt_month
    '''
    calls_count = row['calls_count']
    
    if calls_count < 10:
        return 3
    if calls_count < 25:
        return 2
    if calls_count < 50:
        return 1
    return 0
In [9]:
def out_calls_cnt(row):
    '''
    rate out_calls_cnt
    '''

    calls_count = row['calls_count']

    if calls_count < 30:
        return 3
    if calls_count < 100:
        return 2
    if calls_count < 140:
        return 1
    return 0
In [10]:
def sum_rate_month(row):
    '''
    sum up all the monthly ratings
    '''
    rate_in_missed_call_month = row['rate_in_missed_call_month']
    rate_OUT_AVG_wait_duration_month = row['rate_OUT_AVG_wait_duration_month']
    rate_in_wait_duration_month = row['rate_in_wait_duration_month']
    rate_out_calls_cnt_month = row['rate_out_calls_cnt_month']
    
    rate = rate_in_missed_call_month + rate_OUT_AVG_wait_duration_month + rate_in_wait_duration_month + rate_out_calls_cnt_month
    return rate
In [11]:
def sum_rate(row):
    '''
    sum up all the ratings
    '''
    
    rate_in_missed_call = row['rate_in_missed_call']
    rate_OUT_AVG_wait_duration = row['rate_OUT_AVG_wait_duration']
    rate_in_wait_duration = row['rate_in_wait_duration']
    rate_out_calls_cnt = row['rate_out_calls_cnt']
    
    rate = rate_in_missed_call + rate_OUT_AVG_wait_duration + rate_in_wait_duration + rate_out_calls_cnt
    return rate
In [12]:
def print_pie(df_name, labels_field, values_field, title_text):
    '''
    create a pie chart
    '''    
    
    fig = go.Figure(data=[go.Pie(labels=df_name[labels_field], 
                             values=df_name[values_field])])

    fig.update_layout(
        title= title_text,
        autosize=False,
        width=700,
        height=700,
        margin=dict(
            l=50,
            r=50,
            b=100,
            t=100,
            pad=4
        ))

    fig.show() 
In [13]:
def histogram(df, col, title, xlable, ylable):
    '''
    print a historgram
    '''
    plt.figure(figsize=(15, 5))
    plt.hist(df[col]) 

    plt.title(title)
    plt.xlabel(xlable)
    plt.ylabel(ylable)
    plt.show() 
In [14]:
def histogram_range(df, col, title, xlable, ylable, range_low, range_high):
    '''
    print a historgram with range
    '''
    plt.figure(figsize=(15, 5))
    plt.hist(df[col], range = (range_low, range_high)) 

    plt.title(title)
    plt.xlabel(xlable)
    plt.ylabel(ylable)
    plt.show() 
In [15]:
def run_hypotheses(list1, list2):
    '''
    run H0 hypotheses, check if both lists are similar.
    '''
    #the significance level is 5%
    alpha = 0.05

    results = st.ttest_ind(list1, list2)
    print('p-value: ', results.pvalue)
    if (results.pvalue < alpha):
            print("We reject the null hypothesis")
    else:
            print("We can't reject the null hypothesis")
In [16]:
def get_list_df(data1, data2, col):
    '''
    get a list from DF, with outgoig calls. 
    and run the hypotheses test.
    '''
    
    list1 = data1[(data1['direction'] == 'out') & (data1['is_missed_call'] == False)][col]
    #print(list1.head())
    
    list2 = data2[(data2['direction'] == 'out') & (data2['is_missed_call'] == False)][col]
    #print(list2.head())
    
    run_hypotheses(list1, list2)
In [ ]:
 

step 1. Explore the data, Open the data file and read the general information

Download the data and prepare it for analysis

In [17]:
clients = pd.read_csv('telecom_clients_us.csv', sep=',')
dataset = pd.read_csv('telecom_dataset_us.csv', sep=',')
In [18]:
df_info(clients)
user_id tariff_plan date_start
0 166713 A 2019-08-15
1 166901 A 2019-08-23
2 168527 A 2019-10-29
3 167097 A 2019-09-01
4 168193 A 2019-10-16
shape: (732, 3)

******** info() ********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 17.3+ KB
None

******** info(memory_usage) ********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 98.1 KB
None

******** describe() ********
         user_id
count     732.00
mean  167,431.93
std       633.81
min   166,373.00
25%   166,900.75
50%   167,432.00
75%   167,973.00
max   168,606.00

******** describe(include) ********
       tariff_plan  date_start
count          732         732
unique           3          73
top              C  2019-09-24
freq           395          24

Conclusions:
There are 732 rows.
There are no missing values.
tariff_plan should be a category. But using category with the platform creates problems. So will not change it.
date_start should be datetime.

In [ ]:
 
In [19]:
df_info(dataset)
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-04 00:00:00+03:00 in False nan True 2 0 4
1 166377 2019-08-05 00:00:00+03:00 out True 880,022.00 True 3 0 5
2 166377 2019-08-05 00:00:00+03:00 out True 880,020.00 True 1 0 1
3 166377 2019-08-05 00:00:00+03:00 out True 880,020.00 False 1 10 18
4 166377 2019-08-05 00:00:00+03:00 out False 880,022.00 True 3 0 25
shape: (53902, 9)

******** info() ********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53902 non-null  int64  
 1   date                 53902 non-null  object 
 2   direction            53902 non-null  object 
 3   internal             53785 non-null  object 
 4   operator_id          45730 non-null  float64
 5   is_missed_call       53902 non-null  bool   
 6   calls_count          53902 non-null  int64  
 7   call_duration        53902 non-null  int64  
 8   total_call_duration  53902 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 3.3+ MB
None

******** info(memory_usage) ********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53902 non-null  int64  
 1   date                 53902 non-null  object 
 2   direction            53902 non-null  object 
 3   internal             53785 non-null  object 
 4   operator_id          45730 non-null  float64
 5   is_missed_call       53902 non-null  bool   
 6   calls_count          53902 non-null  int64  
 7   call_duration        53902 non-null  int64  
 8   total_call_duration  53902 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 11.1 MB
None

******** describe() ********
         user_id  operator_id  calls_count  call_duration  total_call_duration
count  53,902.00    45,730.00    53,902.00      53,902.00            53,902.00
mean  167,295.34   916,535.99        16.45         866.68             1,157.13
std       598.88    21,254.12        62.92       3,731.79             4,403.47
min   166,377.00   879,896.00         1.00           0.00                 0.00
25%   166,782.00   900,788.00         1.00           0.00                47.00
50%   167,162.00   913,938.00         4.00          38.00               210.00
75%   167,819.00   937,708.00        12.00         572.00               902.00
max   168,606.00   973,286.00     4,817.00     144,395.00           166,155.00

******** describe(include) ********
                             date direction internal
count                       53902     53902    53785
unique                        119         2        2
top     2019-11-25 00:00:00+03:00       out    False
freq                         1220     31917    47621

Conclusions:
There are 53902 rows.
There are missing values at: operator_id, internal fields.
Date should be datetime
Direction, should be a category. But using category with the platform creates problems. So will not change it.
operator_id should be int
internal should be Boolean

In [ ]:
 

Step 2.Data Preprocessing.

2.1 Rename the columns.

no need to change names

In [ ]:
 

2.2 convert dates to datetime.

In [20]:
clients['date_start'] = pd.to_datetime(clients['date_start'], format = '%Y-%m-%d')

find timestamp range

In [21]:
time_max = clients['date_start'].max()
time_max
Out[21]:
Timestamp('2019-10-31 00:00:00')
In [22]:
time_min = clients['date_start'].min()
time_min
Out[22]:
Timestamp('2019-08-01 00:00:00')
In [ ]:
 

Will create columns for: month, day, day of week.

In [23]:
clients['month_date_start'] = clients['date_start'].dt.month
In [24]:
clients['day_date_start'] = clients['date_start'].dt.day
In [25]:
clients['dayofweek_date_start'] = clients['date_start'].dt.dayofweek
In [26]:
clients.head()
Out[26]:
user_id tariff_plan date_start month_date_start day_date_start dayofweek_date_start
0 166713 A 2019-08-15 8 15 3
1 166901 A 2019-08-23 8 23 4
2 168527 A 2019-10-29 10 29 1
3 167097 A 2019-09-01 9 1 6
4 168193 A 2019-10-16 10 16 2
In [ ]:
 

remove time and zone from date

In [27]:
dataset['date'] = dataset['date'].map(lambda x: x[0:10] )

convert to datetime

In [28]:
dataset['date'] = pd.to_datetime(dataset['date'], format = '%Y-%m-%d')

find timestamp range

In [29]:
time_max = dataset['date'].max()
time_max
Out[29]:
Timestamp('2019-11-28 00:00:00')
In [30]:
time_min = dataset['date'].min()
time_min
Out[30]:
Timestamp('2019-08-02 00:00:00')
In [ ]:
 

Conclusions:
The date range of the research is from: 02-Aug-2019 to 11-Nov-2019

In [ ]:
 

Will create columns for: month, day, day of week.

In [31]:
dataset['month_date'] = dataset['date'].dt.month
In [32]:
dataset['day_date'] = dataset['date'].dt.day
In [33]:
dataset['dayofweek_date'] = dataset['date'].dt.dayofweek
In [ ]:
 

2.3 convert operator_id to int.

in order to convert, we will need to take care of missing values first.

2.4 Look into missing values

In [34]:
dataset[dataset['operator_id'].isnull()].head()
Out[34]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date
0 166377 2019-08-04 in False nan True 2 0 4 8 4 6
7 166377 2019-08-05 in False nan True 6 0 35 8 5 0
9 166377 2019-08-06 in False nan True 4 0 62 8 6 1
17 166377 2019-08-07 in False nan True 2 0 24 8 7 2
27 166377 2019-08-12 in False nan True 2 0 34 8 12 0

Conclusions:
since operator ID is missing, we cannot use those rows.
They will be deleted.

In [35]:
dataset.shape
Out[35]:
(53902, 12)
In [36]:
dataset = dataset[~dataset['operator_id'].isnull()]
In [37]:
dataset.shape
Out[37]:
(45730, 12)
In [38]:
dataset[dataset['operator_id'].isnull()]
Out[38]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date

8172 rows were deleted, there are no more mising values at operator_id.

In [ ]:
 

convert operator_id to int.

In [39]:
dataset['operator_id'] = dataset['operator_id'].astype(int)
In [ ]:
 

STEP 3: Exploratory data analysis:

In [ ]:
 

3.1 Are there any duplicated values? If so, what's their nature?

In [40]:
clients.duplicated().sum()
Out[40]:
0
In [41]:
dataset.duplicated().sum()
Out[41]:
4184
In [42]:
dataset[dataset.duplicated()].head()
Out[42]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date
8 166377 2019-08-05 out False 880020 True 8 0 50 8 5 0
44 166377 2019-08-14 out False 880026 False 10 1567 1654 8 14 2
51 166377 2019-08-15 out False 880026 False 11 1413 1473 8 15 3
62 166377 2019-08-19 out False 880026 False 14 1519 1598 8 19 0
78 166377 2019-08-22 out False 880026 True 6 0 55 8 22 3

Conclusions:
There are 4184 duplicated rows.

In [ ]:
 

Check the distribution of different values at the columns

In [43]:
for i in dataset[dataset.duplicated()].columns:
    print(i, ':', dataset[dataset.duplicated()][i].nunique())
user_id : 240
date : 114
direction : 2
internal : 2
operator_id : 776
is_missed_call : 2
calls_count : 170
call_duration : 1543
total_call_duration : 1787
month_date : 4
day_date : 31
dayofweek_date : 7

there are duplicated values at all of the columns.

lets look at the WHEN

In [44]:
dataset[dataset.duplicated()]['date'].dt.date.nunique()
Out[44]:
114

there are 114 diferant dates.

In [ ]:
 
In [45]:
histogram(dataset[dataset.duplicated()],\
          'dayofweek_date', 'Distribution of duplicates by days of the week', 'days of the week', 'amount')
In [ ]:
 

let's look at how calls are distributed generally

In [46]:
histogram(dataset, 'dayofweek_date', 'Distribution by days of the week', 'days of the week', 'amount')
In [ ]:
 

Conclusions:
There is no apparent pattern to the duplicate rows.
The distributions for duplicated rows are similar to that of the general distribution.
This issue should be reported to the relevant department.
For now those rows will be deleted

In [47]:
dataset.drop_duplicates(inplace=True)
In [ ]:
 

3.2 Check for missing values

In [48]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41546 entries, 1 to 53899
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              41546 non-null  int64         
 1   date                 41546 non-null  datetime64[ns]
 2   direction            41546 non-null  object        
 3   internal             41491 non-null  object        
 4   operator_id          41546 non-null  int32         
 5   is_missed_call       41546 non-null  bool          
 6   calls_count          41546 non-null  int64         
 7   call_duration        41546 non-null  int64         
 8   total_call_duration  41546 non-null  int64         
 9   month_date           41546 non-null  int64         
 10  day_date             41546 non-null  int64         
 11  dayofweek_date       41546 non-null  int64         
dtypes: bool(1), datetime64[ns](1), int32(1), int64(7), object(2)
memory usage: 3.7+ MB

there are missing values at internal,
lets look at those rows.

In [49]:
dataset[dataset['internal'].isnull()]['user_id'].count()
Out[49]:
55

there are 55 rows with missing value at: internal

In [50]:
dataset[dataset['internal'].isnull()].head()
Out[50]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date
1924 166406 2019-09-02 in NaN 879898 False 1 2 9 9 2 0
6210 166541 2019-09-26 in NaN 908960 False 1 393 423 9 26 3
6216 166541 2019-09-26 in NaN 908958 False 2 547 612 9 26 3
7528 166604 2019-11-01 in NaN 893402 False 1 94 117 11 1 4
8650 166658 2019-09-24 in NaN 890404 False 1 150 157 9 24 1
In [51]:
dataset[dataset['internal'].isnull()].shape
Out[51]:
(55, 12)

Lets look at the general distribution between internal and external calls

In [52]:
dataset['internal'].unique()
Out[52]:
array([True, False, nan], dtype=object)
In [53]:
dataset[dataset['internal'] == True]['user_id'].count()
Out[53]:
5330
In [54]:
dataset[dataset['internal'] == False]['user_id'].count()
Out[54]:
36161

Since the majority of the calls are external (internal = False),
and there are only 55 rows with missing value, we will write them as False as well.

In [55]:
dataset['internal'] = dataset['internal'].fillna(False)
In [ ]:
 

3.3 How much data was lost ?

In [56]:
dataset.shape
Out[56]:
(41546, 12)
In [57]:
dataset.info(memory_usage = 'deep')  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41546 entries, 1 to 53899
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              41546 non-null  int64         
 1   date                 41546 non-null  datetime64[ns]
 2   direction            41546 non-null  object        
 3   internal             41546 non-null  bool          
 4   operator_id          41546 non-null  int32         
 5   is_missed_call       41546 non-null  bool          
 6   calls_count          41546 non-null  int64         
 7   call_duration        41546 non-null  int64         
 8   total_call_duration  41546 non-null  int64         
 9   month_date           41546 non-null  int64         
 10  day_date             41546 non-null  int64         
 11  dayofweek_date       41546 non-null  int64         
dtypes: bool(2), datetime64[ns](1), int32(1), int64(7), object(1)
memory usage: 5.5 MB

Conclusions:
Initially there were 53902 rows.
There are no missing values.
After removal of duplicated rows there are now 41546 rows.
all togther 12356 rows were removed.

In [ ]:
 

3.4 How many operators are there ?

In [58]:
dataset['operator_id'].nunique()
Out[58]:
1092

there are 1092 operators

In [ ]:
 

3.5 How big is the data ?

the size of the dara set has changed from: 11.1 MB
to 5.5 MB

In [ ]:
 

3.6 Repair data

there are some error in the date, such as:

In [59]:
dataset[(dataset['user_id'] == 166405) & (dataset['is_missed_call'] == True) & (dataset['direction'] == 'in')]
Out[59]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date
761 166405 2019-08-19 in False 882684 True 1 0 7 8 19 0
762 166405 2019-08-19 in False 882686 True 1 0 5 8 19 0
791 166405 2019-08-24 in False 882684 True 1 0 14 8 24 5
812 166405 2019-08-28 in False 882686 True 2 0 50 8 28 2
929 166405 2019-09-09 in False 882686 True 1 0 8 9 9 0
941 166405 2019-09-11 in False 882684 True 1 0 29 9 11 2
1062 166405 2019-09-27 in False 882684 True 1 0 16 9 27 4
1106 166405 2019-10-02 in False 902532 True 1 0 111 10 2 2
1264 166405 2019-10-21 in False 882690 True 1 0 11 10 21 0
1323 166405 2019-10-25 in False 902510 True 1 0 12 10 25 4
1339 166405 2019-10-28 in False 882690 True 1 0 109 10 28 0
1380 166405 2019-10-30 in False 902510 True 1 0 54 10 30 2
1606 166405 2019-11-19 in False 939478 True 1 165 173 11 19 1
1611 166405 2019-11-20 in False 939476 True 1 0 5 11 20 2
1635 166405 2019-11-21 in False 882686 True 1 1 5 11 21 3
1657 166405 2019-11-22 in False 882686 True 1 1 5 11 22 4

row 1606 is marked as a missed call, yet it has duration.
the developers need to be informed of this.
for now will change such lines to the value of FALSE at is_missed_call

In [ ]:
 
In [60]:
dataset['is_missed_call'] = dataset.apply(fix_is_missed_call, axis = 1)
In [ ]:
 

create a list of months from the dataset

In [61]:
months = dataset['month_date'].unique()
months
Out[61]:
array([ 8,  9, 10, 11], dtype=int64)
In [ ]:
 

Step 4: Identify ineffective operators.

In [ ]:
 

4.1 What are the averages for operators ?

Calculate the averages and grade operators for:
number of missed incoming calls.
waiting time for incoming calls.
amount of out going calls.
wait duration of out going calls.

4.2 calculate waiting duration

In [62]:
dataset['wait_duration'] = dataset['total_call_duration'] - dataset['call_duration']

what was the average duration of a call

In [63]:
dataset['AVG_wait_duration'] = (dataset['wait_duration'] / dataset['calls_count']).astype(int)
In [ ]:
 

join with clients

In [64]:
dataset = pd.merge(dataset, clients, left_on = 'user_id', right_on = 'user_id', how = 'left', sort = False)
dataset.head()
Out[64]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date wait_duration AVG_wait_duration tariff_plan date_start month_date_start day_date_start dayofweek_date_start
0 166377 2019-08-05 out True 880022 True 3 0 5 8 5 0 5 1 B 2019-08-01 8 1 3
1 166377 2019-08-05 out True 880020 True 1 0 1 8 5 0 1 1 B 2019-08-01 8 1 3
2 166377 2019-08-05 out True 880020 False 1 10 18 8 5 0 8 8 B 2019-08-01 8 1 3
3 166377 2019-08-05 out False 880022 True 3 0 25 8 5 0 25 8 B 2019-08-01 8 1 3
4 166377 2019-08-05 out False 880020 False 2 3 29 8 5 0 26 13 B 2019-08-01 8 1 3
In [ ]:
 

split the dataset to: in and out

In [65]:
dataset_in = dataset[dataset['direction'] == 'in']
In [66]:
dataset_out = dataset[dataset['direction'] == 'out']

4.3 number of missed incoming calls

An operator is considered ineffective if they have a large number of missed incoming calls.

In [ ]:
 
In [67]:
dataset_in.head()
Out[67]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date wait_duration AVG_wait_duration tariff_plan date_start month_date_start day_date_start dayofweek_date_start
21 166377 2019-08-12 in False 880028 False 1 407 411 8 12 0 4 4 B 2019-08-01 8 1 3
26 166377 2019-08-13 in False 880028 False 1 88 102 8 13 1 14 14 B 2019-08-01 8 1 3
29 166377 2019-08-14 in False 880026 False 2 197 218 8 14 2 21 10 B 2019-08-01 8 1 3
30 166377 2019-08-14 in False 880028 False 1 33 37 8 14 2 4 4 B 2019-08-01 8 1 3
38 166377 2019-08-15 in False 880028 False 1 23 27 8 15 3 4 4 B 2019-08-01 8 1 3
In [68]:
dataset_in.shape
Out[68]:
(12733, 19)
In [ ]:
 

find out how many missed calls each operator has by month

In [69]:
dataset_in_missed_calls_month = dataset_in \
.pivot_table(index = ('operator_id', 'month_date'), values = 'is_missed_call', aggfunc = 'sum').reset_index()
dataset_in_missed_calls_month.head()
Out[69]:
operator_id month_date is_missed_call
0 879896 8 0.00
1 879896 9 0.00
2 879896 10 0.00
3 879898 8 0.00
4 879898 9 0.00

find out how many missed calls each operator has by full term

In [70]:
dataset_in_missed_calls = dataset_in \
.pivot_table(index = ('operator_id'), values = 'is_missed_call', aggfunc = 'sum').reset_index()
dataset_in_missed_calls.head()
Out[70]:
operator_id is_missed_call
0 879896 0.00
1 879898 0.00
2 880020 0.00
3 880022 0.00
4 880026 0.00
In [ ]:
 

4.4 Visualize distribution of IN missed calls.

by month

In [71]:
for month in months:
    data = dataset_in_missed_calls_month[dataset_in_missed_calls_month['month_date'] == month]
    label = ('Distribution of missed IN calls for month ' + str(month))
    histogram(data, 'is_missed_call', label, 'missed calls', 'operators')
    print('the percentile for 10, 5, 1 percent are:')
    print(np.percentile(data['is_missed_call'], [90, 95, 99])) 
the percentile for 10, 5, 1 percent are:
[0.   0.   2.96]
the percentile for 10, 5, 1 percent are:
[1. 1. 3.]
the percentile for 10, 5, 1 percent are:
[1. 2. 4.]
the percentile for 10, 5, 1 percent are:
[1.   2.   6.45]

for the full term

In [72]:
    label = ('Distribution of missed calls for full term')
    histogram(dataset_in_missed_calls, 'is_missed_call', label, 'missed calls', 'operators')
    print('the percentile for 15, 10, 5, 1 percent are:')
    print(np.percentile(dataset_in_missed_calls['is_missed_call'], [90, 95, 99])) 
the percentile for 15, 10, 5, 1 percent are:
[ 2.  3. 10.]
In [ ]:
 

Conclusions:
only 10% have more then 1 missed calls at the full term.
and most operators have up to one missed call in a month.
for monthly:
up to 3 will be level 3.
up to 2 will be level 2.
up to 1 will be level 1.

for full time:
up to 4 will be level 3.
up to 3 will be level 2.
up to 2 will be level 1.

In [ ]:
 

rate the operators:

In [73]:
dataset_in_missed_calls_month['rate_in_missed_call_month'] = dataset_in_missed_calls_month.apply(in_missed_calls, axis = 1 )
In [74]:
dataset_in_missed_calls['rate_in_missed_call'] = dataset_in_missed_calls.apply(in_missed_calls, axis = 1 )
In [ ]:
 
In [ ]:
 

4.5 wait duration of out going calls.

In [ ]:
 
In [75]:
label = 'general distribution AVG wait duration of outgoing calls'
histogram(dataset_out, 'AVG_wait_duration', label, 'wait duration[sec]', 'operators')
In [76]:
dataset_out_referance_AVG_wait = dataset_out[dataset_out['AVG_wait_duration'] < 80]
dataset_out_referance_AVG_wait.boxplot(column=['AVG_wait_duration'])
plt.title('AVG_wait_duration')
plt.show()
In [ ]:
 
In [ ]:
 

Find out what is the average wait duration for outgoing calls by month

In [77]:
dataset_out_AVG_wait_month = dataset_out_referance_AVG_wait \
.pivot_table(index = ('operator_id', 'month_date'), values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_out_AVG_wait_month.head()
Out[77]:
operator_id month_date AVG_wait_duration
0 879896 8 19.12
1 879896 9 13.71
2 879896 10 8.31
3 879896 11 9.50
4 879898 8 14.43

Find out what is the average wait duration for outgoing calls full term

In [78]:
dataset_out_AVG_wait = dataset_out_referance_AVG_wait. \
pivot_table(index = ('operator_id'), values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_out_AVG_wait.head()
Out[78]:
operator_id AVG_wait_duration
0 879896 15.68
1 879898 14.52
2 880020 4.85
3 880022 13.74
4 880026 12.05

4.6 Visualize distribution of OUT average wait duration.

by month

In [79]:
for month in months:
    data = dataset_out_AVG_wait_month[dataset_out_AVG_wait_month['month_date'] == month]
    label = ('Distribution of wait duration of OUT calls for month ' + str(month))
    histogram(data, 'AVG_wait_duration', label, 'wait duration[sec]', 'operators')
    print('the percentile for 10, 5, 1 percent are:')
    print(np.percentile(data['AVG_wait_duration'], [90, 95, 99])) 
the percentile for 10, 5, 1 percent are:
[23.15       24.89583333 31.94333333]
the percentile for 10, 5, 1 percent are:
[23.17857143 25.19818182 40.02      ]
the percentile for 10, 5, 1 percent are:
[22.552      24.83898305 32.18527273]
the percentile for 10, 5, 1 percent are:
[22.28655914 24.74802632 31.931     ]
In [ ]:
 

full term

In [80]:
histogram(dataset_out_AVG_wait, 'AVG_wait_duration', 'Distribution of wait duration of OUT calls for full term', \
          'wait duration[sec]', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(dataset_out_AVG_wait['AVG_wait_duration'], [90, 95, 99])) 
the percentile for 10, 5, 1 percent are:
[22.19428571 24.10555556 32.595     ]
In [81]:
dataset_out_AVG_wait.boxplot(column=['AVG_wait_duration'])
plt.title('AVG wait duration')
plt.show()
In [82]:
print(np.percentile(dataset_out_AVG_wait['AVG_wait_duration'], [90, 95, 99])) 
[22.19428571 24.10555556 32.595     ]

duration of more them 40 seconds is an outlayer

In [83]:
dataset_out_AVG_wait[dataset_out_AVG_wait['AVG_wait_duration'] < 40]['AVG_wait_duration'].describe()
Out[83]:
count   879.00
mean     14.86
std       6.17
min       0.00
25%      10.70
50%      15.10
75%      19.00
max      38.67
Name: AVG_wait_duration, dtype: float64

conclusion:
only 10% have of operator have a wait duration of more then 22 seconds.
with the average being 15 seconds.
and with 75% of operators wait up to 19 seconds.
only 5% wait longer then 25 seconds. we should consider a wait duration of more then 20 seconds as long-1.
and above 25 as long-2
and above 30 as long-3

In [ ]:
 

rate the operators:

In [ ]:
 
In [84]:
dataset_out_AVG_wait_month['rate_OUT_AVG_wait_duration_month'] = dataset_out_AVG_wait_month.apply(out_AVG_wait, axis = 1 )
In [85]:
dataset_out_AVG_wait['rate_OUT_AVG_wait_duration'] = dataset_out_AVG_wait.apply(out_AVG_wait, axis = 1 )
In [ ]:
 
In [ ]:
 

4.7 waiting time for incoming calls.

An operator is considered ineffective if they have a long waiting time for incoming calls.

Find out what is the average wait duration for incomin calls by month

In [86]:
dataset_in_AVG_wait_month = dataset_in \
.pivot_table(index = ('operator_id', 'month_date'), values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_in_AVG_wait_month.head()
Out[86]:
operator_id month_date AVG_wait_duration
0 879896 8 9.92
1 879896 9 17.83
2 879896 10 20.50
3 879898 8 11.00
4 879898 9 14.12

Find out what is the average wait duration for incomin calls full term

In [87]:
dataset_in.head()
Out[87]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration month_date day_date dayofweek_date wait_duration AVG_wait_duration tariff_plan date_start month_date_start day_date_start dayofweek_date_start
21 166377 2019-08-12 in False 880028 False 1 407 411 8 12 0 4 4 B 2019-08-01 8 1 3
26 166377 2019-08-13 in False 880028 False 1 88 102 8 13 1 14 14 B 2019-08-01 8 1 3
29 166377 2019-08-14 in False 880026 False 2 197 218 8 14 2 21 10 B 2019-08-01 8 1 3
30 166377 2019-08-14 in False 880028 False 1 33 37 8 14 2 4 4 B 2019-08-01 8 1 3
38 166377 2019-08-15 in False 880028 False 1 23 27 8 15 3 4 4 B 2019-08-01 8 1 3
In [88]:
dataset_in_AVG_wait = dataset_in \
.pivot_table(index = 'operator_id', values = 'AVG_wait_duration', aggfunc = 'mean').reset_index()
dataset_in_AVG_wait.head()
Out[88]:
operator_id AVG_wait_duration
0 879896 13.19
1 879898 13.91
2 880020 7.71
3 880022 14.00
4 880026 5.89

4.8 Visualize distribution of IN average wait duration.

by months

In [89]:
for month in months:
    data = dataset_in_AVG_wait_month[dataset_in_AVG_wait_month['month_date'] == month]
    label = ('Distribution of AVG wait duration of IN calls for month ' + str(month))
    histogram(data, 'AVG_wait_duration', label, 'AVG wait duration[sec]', 'operators')
    print('the percentile for 10, 5, 1 percent are:')
    print(np.percentile(data['AVG_wait_duration'], [90, 95, 99]))
the percentile for 10, 5, 1 percent are:
[32.2        34.63333333 42.        ]
the percentile for 10, 5, 1 percent are:
[34.2        40.225      55.69166667]
the percentile for 10, 5, 1 percent are:
[33.212 40.3   56.635]
the percentile for 10, 5, 1 percent are:
[34.34848485 44.         57.75      ]

full term

In [90]:
histogram(dataset_in_AVG_wait, 'AVG_wait_duration', \
          'Distribution of AVG wait duration of IN calls for full term', 'wait duration[sec]', 'operators')
print('the percentile for 10, 5, 1 percent are:')
print(np.percentile(dataset_in_AVG_wait['AVG_wait_duration'], [90, 95, 99]))
the percentile for 10, 5, 1 percent are:
[33.57527473 42.06176471 57.205     ]
In [ ]:
 

we will considere above 60 seconds of wait to be an outlayer.

In [91]:
histogram_range(dataset_in_AVG_wait, 'AVG_wait_duration', \
                'Distribution of AVG wait duration of IN calls for full term', 'wait duration[sec]', 'operators', 0, 60)
In [92]:
dataset_in_AVG_wait[dataset_in_AVG_wait['AVG_wait_duration'] <= 60].boxplot(column=['AVG_wait_duration'])
plt.title('AVG wait duration')
plt.show()
In [93]:
print(np.percentile(dataset_in[dataset_in['AVG_wait_duration'] <= 60]['AVG_wait_duration'], [90, 95, 99])) 
[31. 38. 51.]
In [94]:
dataset_in[dataset_in['AVG_wait_duration'] <= 60]['AVG_wait_duration'].describe()
Out[94]:
count   12,581.00
mean        15.41
std         10.91
min          0.00
25%          7.00
50%         13.00
75%         21.00
max         60.00
Name: AVG_wait_duration, dtype: float64

Conclusions:
Only 10% have a wait duration for incoming calls of more than 32 seconds.
With 75 % wait less than 21 seconds
With the average at 13 seconds
after also looking at the distributions:
above 15 will be long 1.
above 21 will be long 2.
above 32 will be long 3.

In [ ]:
 

rate the operators:

In [ ]:
 
In [95]:
dataset_in_AVG_wait_month['rate_in_wait_duration_month'] = dataset_in_AVG_wait_month.apply(in_AVG_wait, axis = 1 )
In [96]:
dataset_in_AVG_wait['rate_in_wait_duration'] = dataset_in_AVG_wait.apply(in_AVG_wait, axis = 1 )
In [ ]:
 
In [ ]:
 

4.9 amount of out going calls.

In [ ]:
 

if an operator is supposed to make outgoing calls, a small number of them is also a sign of ineffectiveness.

find out how many out calls were made

by month

In [97]:
dataset_out_calls_cnt_month = dataset_out \
.pivot_table(index = ('operator_id', 'month_date'), values = 'calls_count', aggfunc = 'sum').reset_index()
dataset_out_calls_cnt_month.head()
Out[97]:
operator_id month_date calls_count
0 879896 8 697
1 879896 9 100
2 879896 10 66
3 879896 11 9
4 879898 8 1655

full term

In [98]:
dataset_out_calls_cnt = dataset_out \
.pivot_table(index = ('operator_id'), values = 'calls_count', aggfunc = 'sum').reset_index()
dataset_out_calls_cnt.head()
Out[98]:
operator_id calls_count
0 879896 872
1 879898 7209
2 880020 38
3 880022 189
4 880026 2208

4.10 Visualize distribution of OUT calls.

per month

In [99]:
for month in months:
    data = dataset_out_calls_cnt_month[dataset_out_calls_cnt_month['month_date'] == month]
    label = ('Distribution of OUT going calls for month ' + str(month))
    histogram(data, 'calls_count', label, 'out going calls', 'operators')
    print('the percentile for 10, 5, 1 percent are:')
    print(np.percentile(data['calls_count'], [90, 95, 99]))
the percentile for 10, 5, 1 percent are:
[ 272.5   661.25 7139.64]
the percentile for 10, 5, 1 percent are:
[ 604.6   796.1  1897.62]
the percentile for 10, 5, 1 percent are:
[ 989.2  1340.65 3795.06]
the percentile for 10, 5, 1 percent are:
[ 957.4  1536.8  2798.05]

since we are looking for inafective, we will look at operators with a small amount of out calls.

In [ ]:
 
In [ ]:
 
In [100]:
for month in months:
    data = dataset_out_calls_cnt_month[dataset_out_calls_cnt_month['month_date'] == month]
    label = ('Distribution of OUT going calls for month ' + str(month))
    histogram_range(data, 'calls_count', label, 'out going calls', 'operators', 0, 200)
    print('the percentile for 10, 5, 1 percent are:')
    print(np.percentile(dataset_out_calls_cnt_month[(dataset_out_calls_cnt_month['calls_count'] <= 200) \
                                                    & (dataset_out_calls_cnt_month['month_date'] == month)]\
                        ['calls_count'], [10, 5, 1])) 
the percentile for 10, 5, 1 percent are:
[1. 1. 1.]
the percentile for 10, 5, 1 percent are:
[1. 1. 1.]
the percentile for 10, 5, 1 percent are:
[1. 1. 1.]
the percentile for 10, 5, 1 percent are:
[1. 1. 1.]

the precintail are very low, so accorind to the destbution lower then 20 is low-1 and lower then 10 is low-2

full term

In [101]:
histogram(dataset_out_calls_cnt, 'calls_count', 'Distribution of OUT going calls for full term', 'calls count', 'operators')

we will look at the range of up to 500 calls.

In [102]:
histogram_range\
(dataset_out_calls_cnt, 'calls_count', 'Distribution of OUT going calls for full term', 'calls count', 'operators', 0, 500)
In [103]:
dataset_out_calls_cnt[dataset_out_calls_cnt['calls_count'] < 500].boxplot(column=['calls_count'])
plt.title('calls out count')
plt.show()
In [104]:
dataset_out_calls_cnt[dataset_out_calls_cnt['calls_count'] < 500]['calls_count'].describe()
Out[104]:
count   646.00
mean     90.90
std     120.53
min       1.00
25%       7.00
50%      30.50
75%     138.00
max     491.00
Name: calls_count, dtype: float64
In [105]:
print(np.percentile(dataset_out_calls_cnt[dataset_out_calls_cnt['calls_count'] < 500]['calls_count'], [10, 5, 1])) 
[2. 1. 1.]

low-1 will be below 30, low-2 will be below 15

In [ ]:
 

Conclusions:
the precintail are very low, so according to the distribution:
for monthly:
below 50 will be 1
below 25 will be 2
below 10 will be 3

for full term:
below 140 will be 1
below 100 will be 2
below 30 will be 1

In [ ]:
 

rate the operators:

In [106]:
dataset_out_calls_cnt_month['rate_out_calls_cnt_month'] = dataset_out_calls_cnt_month.apply(out_calls_cnt_month, axis = 1)
In [107]:
dataset_out_calls_cnt['rate_out_calls_cnt'] = dataset_out_calls_cnt.apply(out_calls_cnt, axis = 1)
In [ ]:
 
In [ ]:
 

4.11 calculations for month

merge all the results togther

merge: dataset_in_missed_calls_month with dataset_out_AVG_wait_month

In [ ]:
 
In [108]:
rate_data_month = pd.merge(dataset_in_missed_calls_month[['operator_id', 'month_date', 'rate_in_missed_call_month']],\
                           dataset_out_AVG_wait_month[['operator_id','month_date', 'rate_OUT_AVG_wait_duration_month']],\
                           on=('operator_id', 'month_date'), how='left')
#rate_data_month.head()

merge: dataset_in_AVG_wait_duration_month

In [109]:
rate_data_month = pd.merge(rate_data_month, dataset_in_AVG_wait_month \
                           [['operator_id','month_date', 'rate_in_wait_duration_month']],\
                           on=('operator_id', 'month_date'), how='left')
#rate_data_month.head()
In [ ]:
 

merge with: dataset_out_calls_cnt_month

In [110]:
rate_data_month = pd.merge(rate_data_month, dataset_out_calls_cnt_month \
                           [['operator_id','month_date', 'rate_out_calls_cnt_month']], \
                           on=('operator_id', 'month_date'), how='left')
#rate_data_month.head()

replace null with 0

In [111]:
rate_data_month = rate_data_month.fillna(0)

sum up the different ratings:

In [112]:
rate_data_month['rate_month'] = rate_data_month.apply(sum_rate_month, axis = 1)
#rate_data_month.head()

cast values to int

In [113]:
rate_data_month['rate_month'] = rate_data_month['rate_month'].astype(int)
In [114]:
rate_data_month.sort_values(by='rate_month')[['operator_id','month_date', 'rate_month']].tail(10)
Out[114]:
operator_id month_date rate_month
1266 944218 10 7
1273 944226 11 7
186 892530 9 8
1264 944216 10 8
863 921306 9 8
188 892530 11 9
626 908834 9 9
1268 944220 10 9
343 900194 11 9
117 890226 9 9
In [ ]:
 

show the grade for each operator by month

In [115]:
rate_month = rate_data_month.pivot_table\
(index = 'operator_id', columns = 'month_date', values = 'rate_month', aggfunc = 'sum').fillna(0).reset_index()
rate_month = rate_month.astype(int)
rate_month.head()
Out[115]:
month_date operator_id 8 9 10 11
0 879896 0 1 1 0
1 879898 0 0 1 0
2 880020 3 2 0 0
3 880022 0 0 3 1
4 880026 0 0 0 0
In [ ]:
 
In [ ]:
 

4.12 calculations for full time

merge: dataset_in_missed_calls with dataset_out_AVG_wait

In [116]:
rate_data = pd.merge(dataset_in_missed_calls[['operator_id', 'rate_in_missed_call']],\
                     dataset_out_AVG_wait[['operator_id', 'rate_OUT_AVG_wait_duration']],\
                     on=('operator_id'), how='left')
#rate_data.head()

merge: dataset_in_AVG_wait_duration

In [117]:
rate_data = pd.merge(rate_data, dataset_in_AVG_wait[['operator_id', 'rate_in_wait_duration']],\
                     on=('operator_id'), how='left')
#rate_data.head()

merge with: dataset_out_calls_cnt_month

In [118]:
rate_data = pd.merge(rate_data, dataset_out_calls_cnt[['operator_id', 'rate_out_calls_cnt']],\
                     on=('operator_id'), how='left')
#rate_data.head()
In [ ]:
 
In [119]:
rate_data = rate_data.fillna(0)

sum up all the different ratings:

In [120]:
rate_data['rate'] = rate_data.apply(sum_rate, axis = 1)
#rate_data.head()
In [121]:
rate_data['rate'] = rate_data['rate'].astype(int)
In [ ]:
 
In [122]:
rate_data[['operator_id', 'rate']].head()
Out[122]:
operator_id rate
0 879896 0
1 879898 0
2 880020 2
3 880022 0
4 880026 0
In [ ]:
 

4.13 results

how long will the operator list be ?

In [123]:
#user can modify this value, up to 50.
operators_on_list_month = 10
operators_on_list_full_term = 30
In [ ]:
 

4.14 rate results by month

In [124]:
for col in rate_month.columns: 
    if col == 'operator_id':
        continue   
    label = 'Distribution of least effective operators for month ' + str(col)

    histogram(rate_month, col, label, 'least effective grade', 'operators')

    data = rate_month[['operator_id', col]].sort_values(by = col, ascending = False)
    print(data.head(operators_on_list_month))
    
month_date  operator_id  8
16               884408  6
68               892534  6
31               888406  6
72               893172  5
28               887282  5
67               892532  5
7                882478  5
22               885890  4
70               892538  4
117              898920  4
month_date  operator_id  9
251              908834  9
43               890226  9
357              921306  8
66               892530  8
140              900792  6
36               888538  6
337              919794  6
68               892534  6
334              919554  6
283              913474  6
month_date  operator_id  10
596              944220   9
594              944216   8
63               891946   7
595              944218   7
128              900194   7
320              918956   7
334              919554   6
598              944226   6
328              919204   6
572              940610   6
month_date  operator_id  11
66               892530   9
128              900194   9
75               893420   7
504              937750   7
596              944220   7
428              929884   7
598              944226   7
494              937368   6
485              935956   6
673              952916   6
In [ ]:
 

4.15 rate results for full term

In [125]:
label = 'Distribution of score of least effective operators for full term'

histogram(rate_data, 'rate', label, 'least effective grade', 'operators')

data = rate_data[['operator_id', 'rate']].sort_values(by = 'rate', ascending = False)
print(data.head(operators_on_list_full_term))
     operator_id  rate
251       908834     9
66        892530     9
320       918956     8
598       944226     7
654       951290     7
128       900194     7
596       944220     7
594       944216     7
597       944222     6
321       918958     6
572       940610     6
328       919204     6
43        890226     6
329       919206     6
334       919554     6
335       919790     6
662       951650     6
351       920902     6
215       906402     6
595       944218     6
68        892534     6
69        892536     6
309       917252     6
75        893420     6
631       947816     6
428       929884     6
207       906076     6
362       922114     6
369       923254     6
600       944246     6
In [ ]:
 
In [ ]:
 
In [ ]:
 

Step 5: Look into distributions for clients with different tariff.

In [ ]:
 
In [126]:
group_tariff = clients.groupby('tariff_plan').agg({'user_id':'count'}).reset_index()
group_tariff
Out[126]:
tariff_plan user_id
0 A 76
1 B 261
2 C 395
In [127]:
title_text = 'proportions of users by tariff plan'
print_pie(group_tariff, 'tariff_plan', 'user_id', title_text)

Conclusion:
It seems that taarif A is the most expansive.
Then B.
And taarif C is the least expensive.

In [128]:
dataset_out_referance_AVG_wait.pivot_table(index = 'tariff_plan', values = 'wait_duration', aggfunc = ('mean', 'median'))
Out[128]:
mean median
tariff_plan
A 837.95 142.00
B 180.54 67.00
C 254.15 72.00

Conclusion:
it seems that the operators will wait longer for someone to pick up on the other side with calls for client A.

In [129]:
dataset_in[dataset_in['is_missed_call'] == True]\
.pivot_table(index = 'tariff_plan', values = 'is_missed_call', aggfunc = ('count'))
Out[129]:
is_missed_call
tariff_plan
A 132
B 151
C 127
In [130]:
dataset_in.pivot_table(index = 'tariff_plan', values = 'wait_duration', aggfunc = ('mean', 'median'))
Out[130]:
mean median
tariff_plan
A 87.99 39.00
B 120.44 38.00
C 83.52 38.00

Conclusion:
it seems that the operators have more IN missing calls and also longer duration time of calls from clients with taarif B.

In [ ]:
 

Step 6: Hypothesis.

Are the call durations for outgoing calls similar on all days ?

H0 call duration of outgoing calls is similar between days.
H1 call duration of outgoing calls is different between days.

In [ ]:
 

are the the weekend days similar to other days ?

weekend days:

In [131]:
dataset_5_6 = dataset[dataset['dayofweek_date'] >= 5]
#dataset_5_6.head()

other days:

In [132]:
dataset_0_4 = dataset[dataset['dayofweek_date'] <= 4]
#dataset_0_4.head()
In [133]:
get_list_df(dataset_5_6, dataset_0_4, 'call_duration')
p-value:  0.019842807809768256
We reject the null hypothesis

Conclusions:
Call duration is different between weekend days and other days.

In [ ]:
 

are the weekend days similar to weekend days ?

In [134]:
dataset_5 = dataset[dataset['dayofweek_date'] == 5]
dataset_6 = dataset[dataset['dayofweek_date'] == 6]
In [135]:
get_list_df(dataset_5, dataset_6, 'call_duration')
p-value:  0.1669506426712043
We can't reject the null hypothesis

Conclusions:
Call duration between weekend days is similar.

In [ ]:
 

are the other days similar to other days ?

In [136]:
n = 4
for i in range(n-1): 
    for j in range(i, n-1): 
        print('*******************')
        print('for days: {} and {}'.format(i+1, j+2))
        dataset_i = dataset[dataset['dayofweek_date'] == i]
        dataset_j = dataset[dataset['dayofweek_date'] == j]
        get_list_df(dataset_i, dataset_j, 'call_duration')
*******************
for days: 1 and 2
p-value:  1.0
We can't reject the null hypothesis
*******************
for days: 1 and 3
p-value:  0.4650855892592898
We can't reject the null hypothesis
*******************
for days: 1 and 4
p-value:  0.8831624072190745
We can't reject the null hypothesis
*******************
for days: 2 and 3
p-value:  1.0
We can't reject the null hypothesis
*******************
for days: 2 and 4
p-value:  0.39386569907618174
We can't reject the null hypothesis
*******************
for days: 3 and 4
p-value:  1.0
We can't reject the null hypothesis

Conclusions:
Call duration between other days is similar.

Is there a statistical difference of outgping calls duration between operators with high score – above 3 including.
And with those who have a low score - below 3.

H0 call duration of outgoing calls is similar between groups of higt and low grade.
H1 call duration of outgoing calls is different between groups of higt and low grade.

In [137]:
for col in rate_month.columns: 
    if col == 'operator_id':
        continue   

    operator_high = rate_month[rate_month[col] >= 3]['operator_id']
    operator_low = rate_month[rate_month[col] < 3]['operator_id']

    call_duration_high = dataset[(dataset['month_date'] == col) & \
                                 (dataset['operator_id'].isin(operator_high))\
                                 & (dataset['is_missed_call'] == False)]['call_duration']
    call_duration_low = dataset[(dataset['month_date'] == col) & \
                                (dataset['operator_id'].isin(operator_low)) &\
                                (dataset['is_missed_call'] == False)]['call_duration']

    print('*********************')
    print('for month:', col)
        
    run_hypotheses(call_duration_high, call_duration_low)
*********************
for month: 8
p-value:  8.580028663609597e-20
We reject the null hypothesis
*********************
for month: 9
p-value:  6.381587882336673e-31
We reject the null hypothesis
*********************
for month: 10
p-value:  6.236035819285214e-12
We reject the null hypothesis
*********************
for month: 11
p-value:  0.52507798780386
We can't reject the null hypothesis

Conclusion:
out of 4 months, at 3 of them there is a statistical difference between the duration of the outgoing calls between operators with high and low grade.

Step 7: final conclusions.

We have managed to calculate using a grading system a score for every operator to determine least effectiveness.
The higher the score, the less effective is the operator.
Max score is 12.
We have creates a list on a monthly basis, and a list for the full term, with the top least effective.

We have also concluded that the call duration of the outgoing calls is similar between the other days.
And also similar between the weekend days.

Yet the duration is different between the weekend days and the other days.

The operators seem to favor calls of clients with taarif A.
And the disfavor clients with taarif B.

Articles.


A list of articles and the questions they helped me answer:

  1. Virtual Phone System: What Is It & How Does It Work?

link: https://www.nextiva.com/blog/what-is-a-virtual-phone-system.html


Why is it a good idea for a business to use virtual phones ?
What are the benefits ?
What are the top features of a virtual phone system?

  1. How Does a Virtual Phone System Work?

link: https://talkroute.com/how-does-a-virtual-phone-system-work/

How Technology has changed our view of telephones ?

3.Benefits of using a Virtual Phone Number for your Business

link: https://tweakyourbiz.com/management/customer-service/virtual-phone-benefits
What are the benefits of using a Virtual Phone Number for you’re a business ?

  1. 5 Ways Businesses Can Improve Telephone Customer Service

link: https://clutch.co/bpo/virtual-assistant/resources/how-improve-telephone-customer-service
How important is it for a clients to have a short wait duration ?
Who important is it to have a short conversation ?

In [ ]: